Use of this dataset should cite the Bureau of Labor Statistics as per their copyright information: The Bureau of Labor Statistics (BLS) is a Federal government agency and everything that we publish, both in hard copy and electronically, is in the public domain, except for previously copyrighted photographs and illustrations. You are free to use our public domain material without specific permission, although we do ask that you cite the Bureau of Labor Statistics as the source.
Median weekly earnings of full-time wage and salary workers by detailed occupation and sex.
Occupation: Job title as given from BLS. Industry summaries are given in ALL CAPS.
All_workers: Number of workers male and female, in thousands.
All_weekly: Median weekly income including male and female workers, in USD.
M_workers: Number of male workers, in thousands.
M_weekly: Median weekly income for male workers, in USD.
F_workers: Number of female workers, in thousands.
F_weekly: Median weekly income for female workers, in USD.
In [194]:
import pandas as pd
from pandas import DataFrame, Series
We assign more meaningful and self explanatory column headers. The names list can be assigned to the names parameter when reading in the csv file.
In [195]:
names = ['Occupations','All_workers', 'Weekly_Income_Overall',
'Male_Workers (in 1000s)', 'Median_weekly_income (Male)',
'Female_Workers (in 1000s)', 'Median_weekly_income (Female)']
In [196]:
#mac
income = pd.read_csv('/Users/Harish/Documents/HK_Work/Python/Python-for-Data-Analysis/Kaggle-US-Incomes/inc_occ_gender.csv'
,names=names, header=None)
#win
#income = pd.read_csv(r'C:\Users\\Documents\Personal\Python-for-Data-Analysis\Kaggle-US-Incomes\inc_occ_gender.csv'
# ,names=names, header=None)
To manipulate and perform operations on numerical columns, we will need to handle null/empty/non-numerical values in the numerical columns. In our dataset, this means everything except the Occupations column.
In [197]:
income.head()
Out[197]:
The original header that was part of the csv can be dropped
In [198]:
income.drop(0, inplace=True)
In [229]:
income['income_index'] = income.index
The following looks more readable
In [232]:
income.head()
Out[232]:
Now, we will slice the dataset to understand more about the data
In [200]:
#selecting just the occupations and worker count into a new dataframe
job_type_by_worker_count = income[['Occupations','All_workers']]
In [201]:
job_type_by_worker_count.sort_values(by='All_workers', ascending =False)
Out[201]:
The sorting did not happen by the All_workers column. Let's see why that is the case.
In [202]:
income.dtypes
Out[202]:
The columns are objects. We will need to convert them to primitive types, such as integer, string etc.
In [203]:
job_type_by_worker_count2 = income.apply(pd.to_numeric, errors = 'ignore')
In [204]:
job_type_by_worker_count2.dtypes
Out[204]:
The conversion happened for All_workers, Male_Workers (in 1000s), Female_Workers (in 1000s), but not for the other columns. We will see why.
In [205]:
job_type_by_worker_count2.head(n=15)
Out[205]:
The columns for which the type conversion did not happend contain 'Na' values in them. These look like NULL or NA values, but actually, they are strings with the value 'Na'. We can work around these values by replacing them with 0.
In [206]:
income.replace('Na',0,inplace=True)
In [207]:
income.head(n=15)
Out[207]:
This looks better, without the Na string values. Now, we will convert all the columns, except Occupations to int. Occupations is a string column and we can leave it as an object type.
In [208]:
income_clean = income.apply(pd.to_numeric, errors = 'ignore')
In [209]:
income_clean.dtypes
Out[209]:
We will now slice the data and analyze the count of workers in each occupation.
In [210]:
top10_worker_count = income_clean.sort_values(by='All_workers', axis=0,
ascending=False)[['Occupations','All_workers',
'Male_Workers (in 1000s)',
'Female_Workers (in 1000s)']]
In [211]:
top10_worker_count.drop(1, axis=0, inplace=True)
The first row shows the total number of workers. We will drop the row since it is not useful for our analysis and will skew the overall pattern.
In [212]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
In [228]:
top_worker_count_barplot = sb.factorplot(x='Occupations',
y='All_workers', data=top10_worker_count.head(n=5), hue='All_workers',
size=7, aspect=2.3,kind='bar')
top_worker_count_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female",
title = "Top worker counts and the occupations")
sb.set(font_scale=2.3)
We have a plot showing top employee count for 5 occupations. We will append gender wise numbers on the same plot. For this purpose, we use pandas' melt function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
In [ ]:
top10_worker_count.head()
In [ ]:
top10_worker_count = pd.melt(top10_worker_count, id_vars=["Occupations"],
value_vars=['All_workers','Male_Workers (in 1000s)','Female_Workers (in 1000s)'])
In [ ]:
top10_mf_count = income_clean.sort_values(by='All_workers', axis=0,
ascending=False)[['Occupations',
'Male_Workers (in 1000s)',
'Female_Workers (in 1000s)']]
In [ ]:
top10_mf_count.drop(1, axis=0, inplace=True)
In [ ]:
top10_male_female = pd.melt(top10_mf_count, id_vars=["Occupations"],
value_vars=['Male_Workers (in 1000s)','Female_Workers (in 1000s)'])
In [ ]:
top10_worker_count.replace('NaN',0,inplace=True)
top10_male_female.replace('NaN',0,inplace=True)
In [ ]:
sorted_worker_count_n = top10_worker_count.sort_values(by='value', axis=0,
ascending=False)[['Occupations','variable','value']]
In [ ]:
sorted_male_female_count_n = top10_male_female.sort_values(by='value', axis=0,
ascending=False)[['Occupations','variable','value']]
In [ ]:
sorted_worker_count_n.head(n=5)
In [214]:
sorted_male_female_count_n.head(n=5)
Out[214]:
In [222]:
sorted_worker_count_n_barplot = sb.factorplot(x='Occupations',
y='value', data=sorted_worker_count_n.head(n=10), hue='variable',
palette="bright",
size=10, aspect=5,kind='bar')
sorted_worker_count_n_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female",
title = "Top worker counts and the occupations")
sb.set(font_scale=4.0)
In the above plot, we see that there are some professions involving only male employees - such as pales, production, transportation, construction etc. Similarly, there are occuapations that involve only female employees - such as healthcare professionals, education etc. This is just a sample of 20 occupations. We can analyze more, as seen below.
In [ ]:
sorted_male_female_count_n_barplot = sb.factorplot(x='Occupations',
y='value', data=sorted_male_female_count_n.head(n=10), hue='variable',
palette="bright",
size=12, aspect=5,kind='bar')
sorted_male_female_count_n_barplot.set(xlabel = "Occupations", ylabel = "Workers - All, Male, Female",
title = "Top male, female employee counts and the occupations")
sb.set(font_scale=4)
We can observe a similar trend in the above plot as well, where transportation, construction, production, maintenance and groundskeeping involve male employees and education and healthcare involves female employees. We can also see that business, office, sale etc. have both female and male employees.
Now, let us sort the Occupations by weekly median income, irrespective of gender of employees
In [ ]:
top10_income_count = income_clean.sort_values(by='Weekly_Income_Overall', axis=0,
ascending=False)[['Occupations',
'Weekly_Income_Overall']]
low10_income_count = income_clean.sort_values(by='Weekly_Income_Overall', axis=0,
ascending=True)[['Occupations',
'Weekly_Income_Overall']]
In [227]:
top10_income_count_barplot = sb.factorplot(x='Occupations',
y='Weekly_Income_Overall', data=top10_income_count.head(n=5), hue='Occupations',
palette="bright",
size=12, aspect=5, kind='bar')
top10_income_count_barplot.set(xlabel = "Occupations", ylabel = "Weekly Income",
title = "Top Weekly Incomes and the respective occupations")
sb.set(font_scale=2)
#size=12, aspect=5,
In [160]:
low10_income_count = low10_income_count[(low10_income_count != 0).all(1)]
In [162]:
# lowest10_income_count_barplot = sb.factorplot(x='Weekly_Income_Overall',
# y='Occupations', data=low10_income_count.head(n=10), hue='Occupations',palette="bright",
# size=12, aspect=5, kind='bar')
# lowest10_income_count_barplot.set(xlabel='Occupations', ylabel = 'Weekly Income',
# title = 'Lowest Weekly Incomes and the respective occupations')